Advanced models
1 Getting started
1.1 Required packages
library(highcharter) # For interactive plots
library(WDI) # For World Bank data
library(RcppRoll) # Rolling function fast with C++
library(openxlsx) # A cool package to deal with Excel files/formats
library(quantmod) # Package for financial data extraction
library(tsibble) # TS with dataframes framework
library(fable) # Package for time-series models & predictions
library(feasts) # Package for time-series analysis
library(forecast) # Another package for TS (forecasting)
library(tseries) # Simple package for time-series tests
library(plotly) # For 3D graphs
library(mvtnorm) # For multivariate Gaussian distributions
library(crypto2) # For cryto data
library(ggsci) # For color palettes
library(rugarch) # For GARCH models
library(fGarch) # For GARCH models
library(keras) # For neural networks
library(vars) # For VAR (Vector AutoRegressive) models
library(tidyverse) # THE library for data science; at the end to override other functions
set.seed(42) # Random seed
impute <- function(v, n = 6){ # Imputation function
for(j in 1:n){
ind <- which(is.na(v))
if(length(ind)>0){
if(ind[1]==1){ind <- ind[-1]}
v[ind] <- v[ind-1]
}
}
return(v)
}1.2 Fetching some (crypto) data
coins <- crypto_list()
c_info <- crypto_info(coin_list = coins, limit = 30)❯ Scraping crypto info
❯ Processing crypto info
coin_names <- c("Bitcoin", "Ethereum", "Tether USDt", "BNB", "USDC", "Solana")
coin_hist <- crypto_history(coins |> dplyr::filter(name %in% coin_names),
start_date = "20170101",
end_date = "20251206")❯ Scraping historical crypto data
❯ Processing historical crypto data
Coin USDC does not have data available! Cont to next coin.
Coin Solana does not have data available! Cont to next coin.
Coin Solana does not have data available! Cont to next coin.
coin_hist <- coin_hist |> # Timestamps are at midnight, hence we add a day.
dplyr::mutate(date = 1 + as.Date(as.POSIXct(timestamp, origin="1970-01-01")))
coin_hist <- coin_hist |>
group_by(name) |>
mutate(return = close / lag(close) - 1 ) |>
ungroup()2 ARCH-type models
2.1 Realized volatility
Then, let’s have a look at volatility. It’s measured as
\[v_t=\sqrt{\frac{1}{T-1}\sum_{s=1}^T(r_{t-s}-\bar{r})^2},\] where \(\bar{r}\) is the mean return in the sample of \(T\) observations. Below, we use an optimized (C++-based function) to compute it via rolling standard deviation over 20 days.
coin_hist <- coin_hist |>
group_by(name) |>
na.omit() |>
mutate(real_vol_20 = roll_sd(return, n = 20, fill = NA, na.rm = T))coin_hist |>
filter(name == "Bitcoin") |>
pivot_longer(all_of(c("close", "real_vol_20")), names_to = "series", values_to = "value") |>
ggplot(aes(x = date, y = value, color = series)) + geom_line() + theme_bw() +
facet_wrap(vars(series), nrow = 2, scales = "free")# + scale_y_log10()Warning: Removed 19 rows containing missing values or values outside the scale range
(`geom_line()`).
Now let’s have a look a the autocorrelation of the volatility: it’s highly persistent.
crypto_acf <- coin_hist |>
na.omit() %>%
split(.$name) %>%
map(~acf(.$real_vol_20, plot = F)) %>%
map_dfr(
~data.frame(lag = .$lag,
acf = .$acf,
ci = qnorm(0.975)/sqrt(.$n.used)
), .id = "name")
crypto_acf |>
filter(lag > 0, lag < 15) |>
ggplot(aes(x = lag, y = acf, fill = name)) +
geom_col(position = "dodge", alpha = 0.8) + theme_bw() +
scale_fill_viridis_d(option = "magma", end = 0.9, begin = 0.1)Hence, in practice, simple autoregressive models can be too limited because some variables are not exactly stationary.
The realized volatility is based on past returns. But markets (equity mostly) have another very cool and forward-looking indicator called the VIX. The VIX is slightly different, because it is based on forward-looking option prices (taking their implied volatilities over specific horizons). Still, it is linked (empirically) to the traditional volatility.
Let’s have a look below.
min_date <- "1998-01-01" # Starting date
max_date <- "2025-11-07" # Ending date
getSymbols("^VIX", src = 'yahoo', # The data comes from Yahoo Finance
from = min_date, # Start date
to = max_date, # End date
auto.assign = TRUE,
warnings = FALSE)[1] "VIX"
VIX <- VIX |> as.data.frame() |>
rownames_to_column("date") |>
dplyr::mutate(date = as.Date(date))
head(VIX,7) # Have a look at the result! date VIX.Open VIX.High VIX.Low VIX.Close VIX.Volume VIX.Adjusted
1 1998-01-02 24.34 24.93 23.42 23.42 0 23.42
2 1998-01-05 24.11 25.02 23.02 24.36 0 24.36
3 1998-01-06 25.20 25.97 24.87 25.66 0 25.66
4 1998-01-07 26.15 27.43 25.07 25.07 0 25.07
5 1998-01-08 26.24 26.70 25.62 26.01 0 26.01
6 1998-01-09 25.79 29.35 25.69 28.69 0 28.69
7 1998-01-12 28.69 31.08 28.02 28.02 0 28.02
vix_chart <- VIX |> dplyr::select(VIX.Close)
rownames(vix_chart) <- VIX |> dplyr::pull(date)
highchart(type = "stock") %>%
hc_title(text = "Evolution of the VIX") %>%
hc_add_series(as.xts(vix_chart)) %>%
hc_tooltip(pointFormat = '{point.y:.3f}') We clearly see that the series exhibits some moments of extreme activity and its distribution is not the same through time. This is referred to as “clusters of volatility”.
2.2 Some theory
Therefore, we also need to introduce models that allow for this type of feature. In 1982, Robert Engle proposed such a model, called ARCH for “AutoRegressive Conditional Heteroskedasticity”, which was generalized in 1986 to GARCH models. Engle obtained the Nobel Prize in economics in part for this contribution.
As we show below, GARCH models have a direct link with auto-regressive processes!
The idea is to work with the innovations, \(e_t\) and to specify them a bit differently, i.e., like this: \(e_t=\sigma_t z_t\), where \(\sigma_t\) is going to be a changing standard deviation and \(z_t\) is a simple white noise. What matters is that the vol term is modelled as \[\sigma^2_t = a+\sum_{j=1}^pd_j\sigma^2_{t-j} + \sum_{k=1}^qb_ke^2_{t-k},\] hence, the value of \(\sigma_t^2\) depends both on its past and on the past of squared innovations.
There have been MANY extension to these models: T-GARCH, E-GARCH, etc. See for instance GARCH models: structure, statistical inference and financial applications.
2.3 Examples: estimation
There are many packages for GARCH estimation. We propose two: ruGARCH and fGARCH.
In fGARCH, the model is:
\[\sigma_t^2 = \omega + \alpha e_{t-1} + \beta \sigma_{t-1}^2.\] In the output, there are some additional parameters that allow to propose a more general model:
- shape is the shape parameter of the conditional distribution.
- skew is the skewness parameter of the conditional distribution.
- delta a numeric value, the exponent delta of the variance recursion, fixed at 2 usually.
\(\delta\) values other than 2 come from P-GARCH models:
\[\begin{align} \varepsilon_t & = \sigma_t z_t, \qquad z_t \overset{\text{i.i.d.}}{\sim} \text{Skew}(\text{shape},\text{skew}), \\ \sigma_t^{\delta} &= \omega + \sum_{i=1}^{q} \alpha_i\bigl(|\varepsilon_{t-i}| - \gamma_i \varepsilon_{t-i}\bigr)^{\delta} + \sum_{j=1}^{p} \beta_j\,\sigma_{t-j}^{\delta}, \end{align}\]
The distribution is the Skewed Generalized Error Distribution that allows for more flexibility than the simple Gaussian law. It nests many simpler distribution, like the Cauchy, Laplace, GED and Student laws for instance.
fit_f <- garchFit(formula = ~garch(1,1),
data = coin_hist |>
filter(name == "Bitcoin") |>
pull(return),
trace = F)
fit_f
Title:
GARCH Modelling
Call:
garchFit(formula = ~garch(1, 1), data = pull(filter(coin_hist,
name == "Bitcoin"), return), trace = F)
Mean and Variance Equation:
data ~ garch(1, 1)
<environment: 0x152dd6ac0>
[data = pull(filter(coin_hist, name == "Bitcoin"), return)]
Conditional Distribution:
norm
Coefficient(s):
mu omega alpha1 beta1
1.2790e-03 8.8107e-05 1.0231e-01 8.3473e-01
Std. Errors:
based on Hessian
Error Analysis:
Estimate Std. Error t value Pr(>|t|)
mu 1.279e-03 5.246e-04 2.438 0.0148 *
omega 8.811e-05 1.087e-05 8.108 4.44e-16 ***
alpha1 1.023e-01 1.184e-02 8.643 < 2e-16 ***
beta1 8.347e-01 1.577e-02 52.934 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Log Likelihood:
7086.565 normalized: 1.972874
Description:
Thu Dec 4 19:32:26 2025 by user:
Now, with rugarch…
spec <- ugarchspec()
fit_r <- ugarchfit(data = coin_hist |>
filter(name == "Bitcoin") |>
na.omit() |>
pull(return),
spec = spec)
fit_r@fit$coef mu ar1 ma1 omega alpha1
1.271558e-03 -4.037899e-01 3.681241e-01 8.677615e-05 1.002666e-01
beta1
8.370735e-01
The results are not exactly the same! mu and omega are close…
We recall the model below \[s_t = \omega + \alpha e_{t-1} + \beta s_{t-1}.\] ar1 is the autocorrelation coefficient, hence \(\beta\) and ma1 is \(\alpha\).
With both libraries, it is possible to make predictions.
predict(fit_f, n.ahead = 5) meanForecast meanError standardDeviation
1 0.001279035 0.02777110 0.02777110
2 0.001279035 0.02847428 0.02847428
3 0.001279035 0.02911777 0.02911777
4 0.001279035 0.02970811 0.02970811
5 0.001279035 0.03025082 0.03025082
ugarchforecast(fit_r, n.ahead = 3)
*------------------------------------*
* GARCH Model Forecast *
*------------------------------------*
Model: sGARCH
Horizon: 3
Roll Steps: 0
Out of Sample: 0
0-roll forecast [T0=1979-10-14]:
Series Sigma
T+1 0.0020578 0.02914
T+2 0.0009541 0.02971
T+3 0.0013998 0.03024
The values are quite close!
A cool feature of {rugarch} is that you can use forward rolling, as in actual backtests (more on that below).
3 Vector Auto-Regression
3.1 The VAR(1) model
Now we are switching to full multivariate mode!
With the Vector Auto-Regression (VAR). It is written as:
\[X_t = \mu + A X_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d. } (0,\Sigma_\varepsilon),\] with \(X_t \in \mathbb{R}^k, \mu \in \mathbb{R}^k, A \in \mathbb{R}^{k \times k}\).
In two dimensions: \(X_t=[X_{t,1} \ X_{t,2} ]'\).
\[X_t = \mu + \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} X_{t-1} + \text{error}\]
So \(A_{11}\) measures the impact from \(X_{t-1, 1}\) onto $X_{t,1} $ (itself) and same for \(A_{22}\) for the second component. The non-diagonal terms drive the cross-effects: \(A_{12}\) is the impact of \(X_{t-1,2}\) on \(X_{t,1}\) and vice-versa.
A few probabilistic properties.
If the VAR(1) is stable (all eigenvalues of \(A\) satisfy \(|\lambda_i(A)|<1\)), the unconditional mean exists and is given by \[\mathbb{E}[X_t] = (I - A)^{-1}\mu.\]
Next, let \(\Gamma_0 = \mathbb{V}(X_t)\), it must satisfy \[ \Gamma_0 = A \Gamma_0 A' + \Sigma_\varepsilon.\]
Next, define the lag-\(h\) autocovariance matrix: \[\Gamma_h = \text{Cov}(X_t, X_{t-h}) = \mathbb{E}[X_t X_{t-h}'].\]
Using for simplicity \(X_t = A X_{t-1} + \varepsilon_t\) (the constant terms do not affect the covariances): \[ \Gamma_h = \text{Cov}(A X_{t-1} + \varepsilon_t, X_{t-h}) = A \text{Cov}(X_{t-1}, X_{t-h}) + \underbrace{\text{Cov}(\varepsilon_t, X_{t-h})}_{=0} = A \Gamma_{h-1} \]
By recursion, we obtain the multivariate version of the AR(1): \[\Gamma_h = A^h \Gamma_0, \quad h \ge 0.\] Scaling by the variance, we get the autocorrelation: \(A^h\)!
3.2 The VAR(p) Model
A Vector Autoregression of order \(p\), denoted VAR(\(p\)), models a \(k\)-dimensional time series vector \(\mathbf{X}_t = (X_{1,t}, \dots, X_{k,t})^\top\) as:
\[ \mathbf{X}_t = \mathbf{c} + A_1 \mathbf{X}_{t-1} + A_2 \mathbf{X}_{t-2} + \cdots + A_p \mathbf{X}_{t-p} + \mathbf{e}_t, \]
where:
- \(\mathbf{X}_t\) is a \(k \times 1\) vector of endogenous variables,
- \(\mathbf{c}\) is a \(k \times 1\) vector of intercept terms,
- \(A_i\) are \(k \times k\) coefficient matrices,
- \(\mathbf{e}_t\) is a \(k \times 1\) vector of innovations (reduced-form errors).
Error term assumptions: the innovations satisfy
\[\mathbf{e}_t \sim (0, \Sigma_e),\]
where \(\Sigma_e\) is a symmetric positive definite \(k \times k\) covariance matrix, allowing contemporaneous correlation across equations.
3.3 Empirics
VARs are often used on macroeconomic data. Below, we fetch a sample from the World Bank API.
wb_raw <- WDI( # World Bank data
indicator = c(
"labor" = "SL.TLF.TOTL.IN", # Labor force (# individuals)
"savings_rate" = "NY.GDS.TOTL.ZS", # Savings rate (% GDP)
"inflation" = "FP.CPI.TOTL.ZG", # Inflation rate
"trade" = "NE.TRD.GNFS.ZS", # Trade as % of GDP
"pop" = "SP.POP.TOTL", # Population
"pop_growth" = "SP.POP.GROW", # Population growth
"capital_formation" = "NE.GDI.TOTL.ZS", # Gross capital formation (% GDP)
"gdp_percap" = "NY.GDP.PCAP.CD", # GDP per capita
"RD_percap" = "GB.XPD.RSDV.GD.ZS", # R&D per capita
"educ_level" = "SE.SEC.CUAT.LO.ZS", # % pop reachiing second. educ. level
"educ_spending" = "SE.XPD.TOTL.GD.ZS", # Education spending (%GDP)
"nb_researchers" = "SP.POP.SCIE.RD.P6", # Nb researchers per million inhab.
"debt" = "GC.DOD.TOTL.GD.ZS", # Central gov. debt (% of GDP)
"gdp" = "NY.GDP.MKTP.CD" # Gross Domestic Product (GDP)
),
extra = TRUE,
start = 1960,
end = 2024) |>
mutate(across(everything(), as.vector)) |>
select(-status, -lending, -iso2c, -iso3c) |>
filter(region != "Aggregates", income != "Aggregates") |>
arrange(country, year) |>
group_by(country) |>
mutate(across(everything(), impute)) |>
mutate(gdp_growth = gdp_percap/dplyr::lag(gdp_percap) - 1, .before = "region") |>
ungroup() |>
filter(lastupdated == max(lastupdated)) |>
arrange(country, year) Ok, let’s focus on the US for simplicity. Taking into account other countries would mean/imply a Panel VAR.
To lighten the output (which can be heavy), we only focus on 3 variables.
wb_us <- wb_raw |>
filter(country == "United States") |>
select(gdp_growth, inflation, pop_growth) |>
na.omit()
fit_VAR <- VAR(wb_us, p = 1)
fit_VAR |> summary()
VAR Estimation Results:
=========================
Endogenous variables: gdp_growth, inflation, pop_growth
Deterministic variables: const
Sample size: 63
Log Likelihood: 97.776
Roots of the characteristic polynomial:
0.845 0.75 0.2559
Call:
VAR(y = wb_us, p = 1)
Estimation results for equation gdp_growth:
===========================================
gdp_growth = gdp_growth.l1 + inflation.l1 + pop_growth.l1 + const
Estimate Std. Error t value Pr(>|t|)
gdp_growth.l1 0.422576 0.141282 2.991 0.00405 **
inflation.l1 0.001643 0.001513 1.086 0.28192
pop_growth.l1 0.004779 0.012570 0.380 0.70517
const 0.020769 0.014248 1.458 0.15025
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0253 on 59 degrees of freedom
Multiple R-Squared: 0.2914, Adjusted R-squared: 0.2554
F-statistic: 8.087 on 3 and 59 DF, p-value: 0.0001344
Estimation results for equation inflation:
==========================================
inflation = gdp_growth.l1 + inflation.l1 + pop_growth.l1 + const
Estimate Std. Error t value Pr(>|t|)
gdp_growth.l1 30.60299 8.81182 3.473 0.00097 ***
inflation.l1 0.56629 0.09439 6.000 1.29e-07 ***
pop_growth.l1 -0.83768 0.78398 -1.068 0.28965
const 0.84518 0.88868 0.951 0.34546
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.578 on 59 degrees of freedom
Multiple R-Squared: 0.6864, Adjusted R-squared: 0.6705
F-statistic: 43.05 on 3 and 59 DF, p-value: 7.201e-15
Estimation results for equation pop_growth:
===========================================
pop_growth = gdp_growth.l1 + inflation.l1 + pop_growth.l1 + const
Estimate Std. Error t value Pr(>|t|)
gdp_growth.l1 0.838545 0.535011 1.567 0.122
inflation.l1 0.005373 0.005731 0.938 0.352
pop_growth.l1 0.862075 0.047600 18.111 <2e-16 ***
const 0.059493 0.053956 1.103 0.275
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.09582 on 59 degrees of freedom
Multiple R-Squared: 0.8503, Adjusted R-squared: 0.8427
F-statistic: 111.7 on 3 and 59 DF, p-value: < 2.2e-16
Covariance matrix of residuals:
gdp_growth inflation pop_growth
gdp_growth 0.0006403 0.01862 -0.0004703
inflation 0.0186233 2.49093 -0.0238060
pop_growth -0.0004703 -0.02381 0.0091824
Correlation matrix of residuals:
gdp_growth inflation pop_growth
gdp_growth 1.0000 0.4663 -0.1939
inflation 0.4663 1.0000 -0.1574
pop_growth -0.1939 -0.1574 1.0000
Then, onto forecasting!
predict(fit_VAR, n.ahead = 3)$gdp_growth
fcst lower upper CI
[1,] 0.04827954 -0.001316983 0.09787606 0.04959652
[2,] 0.05065569 -0.004265562 0.10557695 0.05492126
[3,] 0.05195311 -0.004846277 0.10875250 0.05679939
$inflation
fcst lower upper CI
[1,] 3.000879 -0.09247298 6.094230 3.093352
[2,] 3.223890 -0.97430754 7.422087 4.198197
[3,] 3.435725 -1.34243374 8.213883 4.778159
$pop_growth
fcst lower upper CI
[1,] 0.9528018 0.7649883 1.140615 0.1878135
[2,] 0.9374872 0.6911543 1.183820 0.2463330
[3,] 0.9274756 0.6415400 1.213411 0.2859356
Another potential package/function is {dynlm}, with documentation and tutorial for VAR.
4 A primer on prediction
4.1 Training/testing & co
The key idea is backtest: you have to put yourself in the shoes of someone living in the past. By this we mean, without knowledge of the future. To do this, you need to consider a loop that spans different dates in the past. At each point in time \(t\) you:
- extract the data available at time \(t\);
- build a predictive model;
- make some prediction based on the currently available data;
- evaluate the accuracy of the prediction thanks to the (future) realized values.
A simplified diagram below.
4.2 Some code
Even though it’s ugly, we will resort to loops.
Below, we focus on the simple case of a unique series (bitcoin) and a simple model (ARMA), but this can be applied to much more complex situations and models.
In the code below, we
- build predictions;
- derive the corresponding errors;
- implement a trading strategy (we buy if returns are predicted to be positive and short sell otheriwse).
btc <- coin_hist |> filter(name == "Bitcoin", wday(date) == 2) |>
ungroup() |> dplyr::select(date, close) |>
mutate(fwd_return = lead(close) / close - 1) |> na.omit()
pred <- 0 # To store the predictions
err <- 0 # To store the error
strat_ret <- 0 # Return of the strategy
buffer <- 100 # Time required to start learning
tc <- 0.001 # Transaction costs
trading_dates <- btc$date[buffer:nrow(btc)]
for(j in 1:(nrow(btc)-buffer+1)){
if(j%%100 == 0){print(trading_dates[j])}
train <- btc |> filter(date < trading_dates[j]) # expanding windows
fit <- arima(train$fwd_return, order = c(1,0,1), include.mean = T)
pred[j] <- predict(fit, n.ahead = 1)$pred[1] |> as.numeric()
err[j] <- btc |> filter(date == trading_dates[j]) |> pull(fwd_return) - pred[j]
strat_ret[j] <- (btc |> filter(date == trading_dates[j]) |> pull(fwd_return)) * sign(pred[j])
if(j>1){strat_ret[j] <- strat_ret[j] - tc * abs(sign(pred[j])-sign(pred[j-1])) } # Costs
}[1] "2020-10-26"
[1] "2022-09-26"
[1] "2024-08-26"
[1] "2025-08-18"
Let’s analyze the outcome. First, the errors…
sd(btc$fwd_return)[1] 0.09755115
sd(err)[1] 0.08771966
At least errors have lower variance than the returns themselves.
Next, onto the strategy.
mean(strat_ret > 0) # This is the hit ratio = accuracy[1] 0.5157385
mean(strat_ret) # The average weekly return[1] 0.007221179
sd(strat_ret) # The volatility[1] 0.08760782
(mean(strat_ret)*52 - 0.03)/sqrt(52) / sd(strat_ret) # The Sharpe ratio[1] 0.5468966
Let’s visualize this…
data.frame(date = trading_dates, return = strat_ret) |>
mutate(portfolio_value = cumprod(1+return)) |>
ggplot(aes(x = date, y = portfolio_value)) + geom_line() +
theme_classic() + theme(axis.title = element_blank())NOTE: in practice trading costs are super important.
We have assumed that it’s possible to short the BTC: not so obvious, but short ETFs do exist.
4.3 What really matters (in finance)
Models (and people) often focus on loss (objective) functions.
In finance, we usually try to forecast returns, hence the loss function is the (root-) Mean Squared Error:
\[\text{MSE} = \frac{1}{T} \sum_{t=1}^T (r_t-\tilde{r}_t)^2, \]
where \(r_t\) is the realized return and \(\tilde{r}_t\) is the model return. Assuming predictions are unbiased \((\mathbb{E}[r_t]=\mathbb{E}[\tilde{r}_t])\), then (after some math…)
\[\text{MSE} = \mathbb{E}[(r_t - \tilde{r}_t)^2]=\mathbb{V}[r_t]+\mathbb{V}[\tilde{r}_t]-2\text{Cov}(r_t, \tilde{r}_t)\]
Question: which of these terms matter the most?
- the variance of realized returns: nothing we can do about that.
- the variance of predicted returns: this enters the variace-bias tradeoff; but it’s not super clear what the impact it can have.
- the correlation between predicted and realized returns: this is the CRUX!!! This measures the accuracy (very similar to the hit ratio!)
5 Advanced temporal models
Before we can move towards chronology-based neural networks, we must explain how simple, feed-forward, NNs work.
5.1 (Feedforward) neural networks
A few images help explain how simple NNs work.
First, the structure. The network takes an input which is then processed via many multiplications, additions and activations.
At the very end, we get the output. It can be a simple number, but could also be more complex (vector, matrix, etc.).
The intermediate activation functions are there to break the linearity. A few examples below:
Next, the training. This phase is intended to determine “good” parameter values. Obviously, the “best” parameters would be ideal, but they are out of reach (the problem is too complex).
The core idea is gradient descent: we seek to minimize a loss function (e.g., the MSE) and this depends on the derivative:
Now, there is a very interesting process called backpropagation that involves lots of computations of derivatives (skipped here):
In the end, the parameters (weights) of the model are locally optimal.
But they depend on the training sample: whether they work well out-of-sample (on testing sets) is another question.
If yes, the model is said to generalize well. That’s the great quest of machine learning.
An important property of NNs is that they are universal approximators: Any continuous function \(f\) can be approximated up to arbitrary precision (on a finite interval) by a single layer perceptron with smooth (bounded continuous) activation function.
How come? You just need to add units to help the perceptron overfit!
5.2 Recurrent networks
Feed-forward have zero memory. How can we bring some in the model?
Problem: vanishing gradients… Hence: longer term memory! LSTM or GRU…
The system of equations is the following \[\begin{align*} \tilde{y}_i&=z_i\tilde{y}_{i-1}+ (1-z_i)\tanh \left(\textbf{w}_y'\textbf{x}_i+ b_y+ u_yr_i\tilde{y}_{i-1}\right) \quad \textcolor{red}{\text{output}} \\ z_i &= \text{sig}(\textbf{w}_z'\textbf{x}_i+b_z+u_z\tilde{y}_{i-1}) \quad \textcolor{red}{\text{`update gate'}} \ \in (0,1)\\ r_i &= \text{sig}(\textbf{w}_r'\textbf{x}_i+b_r+u_r\tilde{y}_{i-1}) \quad \textcolor{red}{\text{`reset gate'}} \ \in (0,1) \end{align*}\]
Preparing the data should follow this principle/structure
Sequence 1: [price_day1, price_day2, …, price_day10] → target: price_day11
Sequence 2: [price_day2, price_day3, …, price_day11] → target: price_day12
Sequence 3: [price_day3, price_day4, …, price_day12] → target: price_day13
5.3 Example
First, let’s fetch some new data! And compute returns.
getSymbols("NVDA", from = "2010-01-01")[1] "NVDA"
prices <- Cl(NVDA) # Closing price
nvda <- prices / lag(prices) - 1 # Compute return
nvda <- nvda[-1] # Removing missing first elementNext, let’s prepare and format the inputs/outputs.
n <- length(nvda)
train_size <- floor(0.8 * n)
look_back <- 20
# Create sequences
X <- matrix(NA, nrow = n - look_back, ncol = look_back)
y <- numeric(n - look_back)
for (i in 1:(n - look_back)) {
X[i, ] <- nvda[i:(i + look_back - 1)]
y[i] <- nvda[i + look_back]
}
head(X[,1:5]) [,1] [,2] [,3] [,4] [,5]
[1,] 0.014602526 0.006396570 -0.019597489 0.002161031 -0.01401618
[2,] 0.006396570 -0.019597489 0.002161031 -0.014016185 -0.03389832
[3,] -0.019597489 0.002161031 -0.014016185 -0.033898325 0.01358237
[4,] 0.002161031 -0.014016185 -0.033898325 0.013582371 -0.01563372
[5,] -0.014016185 -0.033898325 0.013582371 -0.015633723 -0.02949520
[6,] -0.033898325 0.013582371 -0.015633723 -0.029495204 0.01870255
Check the structure of features.
Final prep: splitting & shaping
X_train <- X[1:train_size, ]
y_train <- y[1:train_size]
X_test <- X[(train_size + 1):nrow(X), ]
y_test <- y[(train_size + 1):length(y)]
# Reshape for GRU [samples, timesteps, features]
X_train <- array(X_train, dim = c(nrow(X_train), ncol(X_train), 1))
X_test <- array(X_test, dim = c(nrow(X_test), ncol(X_test), 1))Now: onto buidling the (simple) model.
model <- keras_model_sequential() %>%
layer_gru(units = 16, input_shape = c(look_back, 1), return_sequences = FALSE) %>%
layer_dense(units = 1, activation = "tanh")
model %>% compile(optimizer = optimizer_adam(learning_rate = 0.001), loss = "mse")Training…
fit_ <- model %>% fit(X_train, y_train, epochs = 60, batch_size = 32, verbose = 0)plot(fit_)We indeed see some improvement on the training set. It slows down after ~20 epochs.
Predict & evaluate.
y_pred <- model %>% predict(X_test, verbose = 0)
mean(abs(y_test - y_pred)) # MAE[1] 0.02314833
mean(y_test * y_pred > 0) # Hit ratio[1] 0.5441741
cor(y_test, y_pred) [,1]
[1,] 0.07798228
data.frame(pred = y_pred, actual = y_test, t = 1:length(y_test)) |>
pivot_longer(-t, values_to = "value", names_to = "type") |>
ggplot(aes(x = t, y = value, color = type)) + geom_line() +
theme_minimal()Clearly the dispersion of prediction is flawed. But the correlation seems positive… which is what matters!
5.4 Code for universal approximation
A first model.
model_ua <- keras_model_sequential()
model_ua %>% # This defines the structure of the network, i.e. how layers are organized
layer_dense(units = 16, activation = 'sigmoid', input_shape = 1) %>%
layer_dense(units = 1) #
model_ua %>% keras::compile( # Model specification
loss = 'mean_squared_error', # Loss function
optimizer = optimizer_rmsprop(), # Optimisation method (weight updating)
metrics = c('mean_absolute_error') # Output metric
)
summary(model_ua) Model: "sequential_1"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
dense_2 (Dense) (None, 16) 32
dense_1 (Dense) (None, 1) 17
================================================================================
Total params: 49 (196.00 Byte)
Trainable params: 49 (196.00 Byte)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
fit_ua <- model_ua %>%
fit(seq(0, 6, by = 0.001) %>% matrix(ncol = 1), # Training data = x
sin(seq(0, 6, by = 0.001)) %>% matrix(ncol = 1), # Training label = y
verbose = 0,
epochs = 30, batch_size = 64 # Training parameters
) A second model.
model_ua2 <- keras_model_sequential()
model_ua2 %>% # This defines the structure of the network, i.e. how layers are organized
layer_dense(units = 128, activation = 'sigmoid', input_shape = 1) %>%
layer_dense(units = 1) #
model_ua2 %>% keras::compile( # Model specification
loss = 'mean_squared_error', # Loss function
optimizer = optimizer_rmsprop(), # Optimisation method (weight updating)
metrics = c('mean_absolute_error') # Output metric
)
summary(model_ua2) Model: "sequential_2"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
dense_4 (Dense) (None, 128) 256
dense_3 (Dense) (None, 1) 129
================================================================================
Total params: 385 (1.50 KB)
Trainable params: 385 (1.50 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
fit_ua2 <- model_ua2 %>%
fit(seq(0, 6, by = 0.0002) %>% matrix(ncol = 1), # Training data = x
sin(seq(0, 6, by = 0.0002)) %>% matrix(ncol = 1), # Training label = y
verbose = 0,
epochs = 60, batch_size = 64 # Training parameters
) The comparisons.
tibble(x = seq(0, 6, by = 0.001)) %>%
ggplot() +
geom_line(aes(x = x, y = predict(model_ua, x), color = "Small model")) +
geom_line(aes(x = x, y = predict(model_ua2, x), color = "Large model")) +
stat_function(fun = sin, aes(color = "sin(x) function")) +
scale_color_manual(values = c("#EEAA33", "#3366CC", "#000000")) + theme_bw()188/188 - 0s - 92ms/epoch - 491us/step
188/188 - 0s - 90ms/epoch - 476us/step